Approximate TF-IDF based on topic extraction from massive message stream using the GPU

نویسندگان

  • Ugo Erra
  • Sabrina Senatore
  • Fernando Minnella
  • Giuseppe Caggianese
چکیده

The Web is a constantly expanding global information space that includes disparate types of data and resources. Recent trends demonstrate the urgent need to manage the large amounts of data stream, especially in specific domains of application such as critical infrastructure systems, sensor networks, log file analysis, search engines and more recently, social networks. All of these applications involve large-scale data-intensive tasks, often subject to time constraints and space complexity. Algorithms, data management and data retrieval techniques must be able to process data stream, i.e., process data as it becomes available and provide an accurate response, based solely on the data stream that has already been provided. Data retrieval techniques often require traditional data storage and processing approach, i.e., all data must be available in the storage space in order to be processed. For instance, a widely used relevance measure is Term Frequency–Inverse Document Frequency (TF–IDF), which can evaluate how important a word is in a collection of documents and requires to a priori know the whole dataset. To address this problem, we propose an approximate version of the TF–IDF measure suitable to work on continuous data stream (such as the exchange of messages, tweets and sensor-based log files). The algorithm for the calculation of this measure makes two assumptions: a fast response is required, and memory is both limited and infinitely smaller than the size of the data stream. In addition, to face the great computational power required to process massive data stream, we present also a parallel implementation of the approximate TF–IDF calculation using Graphical Processing Units (GPUs). This implementation of the algorithm was tested on generated and real data stream and was able to capture the most frequent terms. Our results demonstrate that the approximate version of the TF–IDF measure performs at a level that is comparable to the solution of the precise TF–IDF measure. 2014 Elsevier Inc. All rights reserved.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Topic Discovery Algorithm Based on Mutual Information and Label Clustering under Dynamic Social Networks

In recent years, topic detection has become a hot research point of the social network, which can be very good to find the key factors from the massive information and thus discover the topics. The traditional label propagation-based topic discovery algorithm (LPA) is widely concerned because of its approximate linear time complexity and there is no need to define the target function. However, ...

متن کامل

CitationAS: A Summary Generation Tool Based on Clustering of Retrieved Citation Content

Usually, if researchers want to understand research status of any field, they need to browse a great number of related academic literatures. Luckily, in order to work more efficiently, automatic documents summarization can be applied for taking a glance at specific scientific topics. In this paper, we focus on summary generation of citation content. An automatic tool named CitationAS is built, ...

متن کامل

Clustering Massive Text Data Streams by Semantic Smoothing Model

Clustering text data streams is an important issue in data mining community and has a number of applications such as news group filtering, text crawling, document organization and topic detection and tracing etc. However, most methods are similarity-based approaches and use the TF*IDF scheme to represent the semantics of text data and often lead to poor clustering quality. In this paper, we fir...

متن کامل

Hybrid Approach for Single Text Document Summarization Using Statistical and Sentiment Features

Summarization is a way to represent same information in concise way with equal sense. This can be categorized in two type Abstractive and Extractive type. Our work is focused around Extractive summarization. A generic approach to extractive summarization is to consider sentence as an entity, score each sentence based on some indicative features to ascertain the quality of sentence for inclusion...

متن کامل

Entity Recognition in a Web Based Join Structure

Given a document, the task of Entity Recognition is to identify predefined entities such as person names, products, or locations in this document. With a potentially large dictionary, this entity recognition problem transforms into a Dictionary-based Membership Checking problem called Approximate Membership Extraction (AME) which aims at finding all possible substrings from a document that matc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Inf. Sci.

دوره 292  شماره 

صفحات  -

تاریخ انتشار 2015